On Intra-page and Inter-page Semantic Analysis of Web Pages

نویسندگان

  • Jun Wang
  • Jicheng Wang
  • Gangshan Wu
  • Hiroshi Tsuda
چکیده

To make real Web information more machine processable, this paper presents a new approach to intra-page and inter-page semantic analysis of Web pages. Our approach consists of Web pages structure analysis and semantic clustering for intra-page semantic analysis, and machine learning based link semantic analysis for inter-page analysis. Based on the automatic repetitive patterns discovery in structure level and clustering in semantic level, we explore the intra-page semantic structure of Web pages and extend the processing unit from the whole page to a finer granularity, i.e., semantic information blocks within pages. After observing the various hyperlinks, we synthesize the Web inter-page semantic and define an information organizing oriented hyperlink semantic category. Considering the presentation of the hyperlink carrier and intra-page semantic structure, we propose corresponding feature selection and quantification methods, and then exploit the C4.5 decision-tree method to classify hyperlink semantic type and analyze the inter-page semantic structure. In our experiments, the results suggest that our approch is feasible for machine processing.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Intra/Inter-document Change Awareness for Co-authoring of Web Sites

Systems that support the co-authoring of web sites often allow users to freely edit pages. This can result in semantic inconsistencies within and between pages. We propose a change awareness mechanism that monitors intraand inter-document edits, taking into account changes made to a page and pages connected to it through html or transclusion links. The effect of all the changes is computed base...

متن کامل

Prioritize the ordering of URL queue in Focused crawler

The enormous growth of the World Wide Web in recent years has made it necessary to perform resource discovery efficiently. For a crawler it is not an simple task to download the domain specific web pages. This unfocused approach often shows undesired results. Therefore, several new ideas have been proposed, among them a key technique is focused crawling which is able to crawl particular topical...

متن کامل

Hypertext Semantics for Web Applications

Web applications integrate dynamic pages for publishing data, functions for content management, and generic business services. To support the model-driven design and automatic generation of Web applications, an extended notion of hypertext is required, whose semantics is scarcely investigated. In this paper, we analyse and classify the semantic problems encountered in the computation of Web pag...

متن کامل

Web Page Classification Based on Uncorrelated Semi-Supervised Intra-View and Inter-View Manifold Discriminant Feature Extraction

Web page classification has attracted increasing research interest. It is intrinsically a multi-view and semi-supervised application, since web pages usually contain two or more types of data, such as text, hyperlinks and images, and unlabeled pages are generally much more than labeled ones. Web page data is commonly high-dimensional. Thus, how to extract useful features from this kind of data ...

متن کامل

Web page classification based on a support vector machine using a weighted vote schema

Traditional information retrieval method use keywords occurring in documents to determine the class of the documents, but usually retrieves unrelated web pages. In order to effectively classify web pages solving the synonymous keyword problem, we propose a web page classification based on support vector machine using a weighted vote schema for various features. The system uses both latent seman...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2003